在过去的几年中,基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是,较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍,尤其是在手机和物联网(IoT)设备上。为了压缩该模型,最近有大量文献围绕知识蒸馏(KD)的主题长大。然而,KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件,并提出了一个统一的KD框架。通过框架,花费了23,000多个GPU小时的系统和广泛的实验,从知识类型的角度,匹配策略,宽度深度折衷,初始化,型号大小等。在培训前语言模型中,对先前最新的(SOTA)的相对显着改善。最后,我们为基于变压器模型的KD提供了最佳实践指南。
translated by 谷歌翻译
实现通用语言情报是自然语言处理的长期目标,标准评估基准发挥基本和指导作用。我们认为,对于通用语言智能评估,基准本身需要全面和系统。为此,我们提出了Cuge,一种中文语言理解和生成评估基准,具有以下特征:(1)分层基准框架,其中数据集主要选择和组织语言能力 - 任务数据集层次结构。 (2)多级评分策略,其中基于分层框架提供了不同级别的模型性能。为了促进CUGE,我们提供了一个公共排行榜,可以自定义,以支持灵活的模型判断标准。代表性预先训练的语言模型的评估结果表明了对通用语言智能的完善的充足空间。 Cuge在Cuge.baai.ac.cn上公开提供。
translated by 谷歌翻译
尽管近期图形神经网络(GNN)进展,但解释了GNN的预测仍然具有挑战性。现有的解释方法主要专注于后性后解释,其中采用另一种解释模型提供培训的GNN的解释。后HOC方法未能揭示GNN的原始推理过程的事实引发了建立GNN与内置解释性的需求。在这项工作中,我们提出了原型图形神经网络(Protgnn),其将原型学习与GNNS相结合,并提供了对GNN的解释的新视角。在Protgnn中,解释自然地从基于案例的推理过程衍生,并且实际在分类期间使用。通过将输入与潜伏空间中的一些学习原型的输入进行比较来获得ProtGnn的预测。此外,为了更好地解释性和更高的效率,结合了一种新颖的条件子图采样模块,以指示输入图的哪个部分与ProtGnn +中的每个原型最相似。最后,我们在各种数据集中评估我们的方法并进行具体的案例研究。广泛的结果表明,Protgnn和Protgnn +可以提供固有的解释性,同时实现与非可解释对方的准确性有关的准确性。
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译
With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.
translated by 谷歌翻译